A primer on partitional and hierarchical clustering algorithms, with applications to financial datasets
library(tidyverse)
library(fpc)
library(factoextra)
library(fungible)
library(ggdendro)
In tackling these problems, we use the notions of distance we studied in previous weeks
There are two main classes of clustering algorithms: partitional and hierarchical
It attempts to split the samples (rows) of into a predetermined number of clusters
Step 1: Initialize a random set of K centroids
Step 2: Assign each sample to one cluster such that the within-cluster variance is minimised
Step 3: Update K centroids based on the clusters from Step 2
Step 4: Repeat steps 2 and 3 until convergence]
fungible::monte() simulates a set of clusters which have some proportion of total variance is due to their mixture.ggdendro packageset.seed(12345)
f1 <- rnorm(45, rep(1:3, each = 15), 0.2)
f2 <- rnorm(45, rep(c(1, 2, 1), each = 15), 0.2)
tibble(x = f1, y = f2, obs = 1:45) -> dat
dat %>%
ggplot(aes(x = x, y = y)) +
geom_point(colour = 'pink') +
geom_label(aes(label = obs)) the function hclust() takes a distance matrix dist (default is euclidean distance) from the tibble dat and then derives a linkage matrix using a single-linkage criterion.
Initially, each observation is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.
At each stage distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used. ?hclust() for more
Lopez de Prado (2020) presents the ONC algorithm, which recovers the number of clusters from a shuffled block-diagonal correlation matrix.
ONC belongs to the broader class of algorithms that apply theseSilhouette method.
Although we typically focus on finding the number of clusters withina correlation matrix, this algorithm can be applied to any generic observation matrix.
Important
The gap statistic is an attempt to automate the “elbow finding” on the WSS curve. It works best when thedata comes from a mix of populations that all have approximately Gaussian distributions (a mixture of Gaussian). - @Tibshirani2001
Silhouette scores are defined for each sample as:
####Advantages: * The scores are bounded [-1,1] * Because we have one score per sample, we can reallocate specific objects to better clusters * Clusters with average are overlapping, and could be merged * We can use to derive a distribution of scores, and make inference (p-values). For example, we can compute the t-value,
Rows: 7,389
Columns: 7
$ date <date> 1988-10-03, 1988-10-04, 1988-10-05, 1988-10-06, 1988-10-07, 1988…
$ rm <dbl> -0.0108511, 0.0021296, 0.0103932, 0.0063450, 0.0036179, 0.0001750…
$ rf <dbl> 0.0004220, 0.0004233, 0.0004251, 0.0004254, 0.0004254, 0.0004223,…
$ rmrf <dbl> -0.01127306, 0.00170627, 0.00996818, 0.00591960, 0.00319251, -0.0…
$ smb <dbl> 0.00525148, -0.00085576, -0.00446347, -0.00249881, 0.00030991, 0.…
$ hml <dbl> 0.00068780, -0.00139754, 0.00140606, 0.00250139, -0.00277563, -0.…
$ umd <dbl> -0.00508378, 0.00121570, -0.00373774, -0.00266541, 0.00224610, 0.…
The clustering of correlation matrices is peculiar in the sense that the features match the observations: we try to group observations where the observations themselves are the features (hence the symmetry of X).
Matrix X appears to be a distance matrix, but it is not. It is still an observations matrix, on which distances can be evaluated.
For large matrices X, generally it is good practice to reduce its dimension via PCA.
The idea is to replace X with its standardized orthogonal projection onto a lower-dimensional space, where the number of dimensions is given by the number of eigenvalues in X’s correlation matrix that exceed
AI and Trading